EDA

This section introduces the Exploratory Data Analysis component of DataPrep.

Introduction to Exploratory Data Analysis and dataprep.eda

Exploratory Data Analysis (EDA) is the process of exploring a dataset and getting an understanding of its main characteristics. The dataprep.eda package simplifies this process by allowing the user to explore important characteristics with simple APIs. Each API allows the user to analyze the dataset from a high level to a low level, and from different perspectives. Specifically, dataprep.eda provides the following functionality:

  • Analyze column distributions with plot(). The function plot() explores the column distributions and statistics of the dataset. It will detect the column type, and then output various plots and statistics that are appropriate for the respective type. The user can optionally pass one or two columns of interest as parameters: If one column is passed, its distribution will be plotted in various ways, and column statistics will be computed. If two columns are passed, plots depicting the relationship between the two columns will be generated.

  • Analyze correlations with plot_correlation(). The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. By default, it plots correlation matrices with various metrics. The user can optionally pass one or two columns of interest as parameters: If one column is passed, the correlation between this column and all other columns will be computed and ranked. If two columns are passed, a scatter plot and regression line will be plotted.

  • Analyze missing values with plot_missing(). The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. By default, it will generate various plots which display the amount of missing values for each column and any underlying patterns of the missing values in the dataset. To understand the impact of the missing values in one column on the other columns, the user can pass the column name as a parameter. Then, plot_missing() will generate the distribution of each column with and without the missing values from the given column, enabling a thorough understanding of their impact.

The following sections give a simple demonstration of plot(), plot_correlation(), and plot_missing(), using an example dataset.

Analyze distributions with plot()

The function plot() explores the distributions and statistics of the dataset. The following describes the functionality of plot() for a given dataframe df.

  1. plot(df): plots the distribution of each column and calculates dataset statistics

  2. plot(df, x): plots the distribution of column x in various ways and calculates column statistics

  3. plot(df, x, y): generates plots depicting the relationship between columns x and y

The following shows an example of plot(df). It plots a histogram for each numerical column, a bar chart for each categorical column, and computes dataset statistics.

[1]:
from dataprep.eda import plot
from dataprep.datasets import load_dataset
import numpy as np
df = load_dataset('house_prices_train')
plot(df)
[1]:
DataPrep.EDA Report

Dataset Statistics

Number of Variables 81
Number of Rows 1460
Missing Cells 348
Missing Cells (%) 0.3%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 4.1 MB
Average Row Size in Memory 2.8 KB
Variable Types
  • Numerical: 38
  • Categorical: 43

Dataset Insights

Id is uniformly distributed Uniform
BsmtFinSF2 and EnclosedPorch have similar distributions Similar Distribution
LowQualFinSF and BsmtHalfBath have similar distributions Similar Distribution
LowQualFinSF and 3SsnPorch have similar distributions Similar Distribution
LowQualFinSF and PoolArea have similar distributions Similar Distribution
LowQualFinSF and MiscVal have similar distributions Similar Distribution
BsmtFullBath and HalfBath have similar distributions Similar Distribution
BsmtHalfBath and 3SsnPorch have similar distributions Similar Distribution
BsmtHalfBath and PoolArea have similar distributions Similar Distribution
BsmtHalfBath and MiscVal have similar distributions Similar Distribution

Dataset Insights

EnclosedPorch and ScreenPorch have similar distributions Similar Distribution
3SsnPorch and PoolArea have similar distributions Similar Distribution
3SsnPorch and MiscVal have similar distributions Similar Distribution
ScreenPorch and MiscVal have similar distributions Similar Distribution
PoolArea and MiscVal have similar distributions Similar Distribution
LotFrontage has 259 (17.74%) missing values Missing
GarageYrBlt has 81 (5.55%) missing values Missing
MSSubClass is skewed Skewed
LotFrontage is skewed Skewed
LotArea is skewed Skewed

Dataset Insights

OverallQual is skewed Skewed
OverallCond is skewed Skewed
YearBuilt is skewed Skewed
YearRemodAdd is skewed Skewed
MasVnrArea is skewed Skewed
BsmtFinSF1 is skewed Skewed
BsmtFinSF2 is skewed Skewed
TotalBsmtSF is skewed Skewed
2ndFlrSF is skewed Skewed
LowQualFinSF is skewed Skewed

Dataset Insights

BsmtFullBath is skewed Skewed
BsmtHalfBath is skewed Skewed
FullBath is skewed Skewed
HalfBath is skewed Skewed
BedroomAbvGr is skewed Skewed
KitchenAbvGr is skewed Skewed
TotRmsAbvGrd is skewed Skewed
Fireplaces is skewed Skewed
GarageCars is skewed Skewed
WoodDeckSF is skewed Skewed

Dataset Insights

OpenPorchSF is skewed Skewed
EnclosedPorch is skewed Skewed
3SsnPorch is skewed Skewed
ScreenPorch is skewed Skewed
PoolArea is skewed Skewed
MiscVal is skewed Skewed
MoSold is skewed Skewed
YrSold is skewed Skewed
LotFrontage has 259 (17.74%) infinite values Infinity
GarageYrBlt has 81 (5.55%) infinite values Infinity

Dataset Insights

Street has constant length 4 Constant Length
LotShape has constant length 3 Constant Length
LandContour has constant length 3 Constant Length
Utilities has constant length 6 Constant Length
LandSlope has constant length 3 Constant Length
ExterQual has constant length 2 Constant Length
ExterCond has constant length 2 Constant Length
BsmtFinType1 has constant length 3 Constant Length
BsmtFinType2 has constant length 3 Constant Length
HeatingQC has constant length 2 Constant Length

Dataset Insights

CentralAir has constant length 1 Constant Length
KitchenQual has constant length 2 Constant Length
GarageFinish has constant length 3 Constant Length
PavedDrive has constant length 1 Constant Length
MasVnrArea has 861 (58.97%) zeros Zeros
BsmtFinSF1 has 467 (31.99%) zeros Zeros
BsmtFinSF2 has 1293 (88.56%) zeros Zeros
BsmtUnfSF has 118 (8.08%) zeros Zeros
2ndFlrSF has 829 (56.78%) zeros Zeros
LowQualFinSF has 1434 (98.22%) zeros Zeros

Dataset Insights

BsmtFullBath has 856 (58.63%) zeros Zeros
BsmtHalfBath has 1378 (94.38%) zeros Zeros
HalfBath has 913 (62.53%) zeros Zeros
Fireplaces has 690 (47.26%) zeros Zeros
GarageCars has 81 (5.55%) zeros Zeros
GarageArea has 81 (5.55%) zeros Zeros
WoodDeckSF has 761 (52.12%) zeros Zeros
OpenPorchSF has 656 (44.93%) zeros Zeros
EnclosedPorch has 1252 (85.75%) zeros Zeros
3SsnPorch has 1436 (98.36%) zeros Zeros

Dataset Insights

ScreenPorch has 1344 (92.05%) zeros Zeros
PoolArea has 1453 (99.52%) zeros Zeros
MiscVal has 1408 (96.44%) zeros Zeros
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

For more information about the function plot() see here.

Analyze correlations with plot_correlation()

The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. The following describes the functionality of plot_correlation() for a given dataframe df.

  1. plot_correlation(df): plots correlation matrices (correlations between all pairs of columns)

  2. plot_correlation(df, x): plots the most correlated columns to column x

  3. plot_correlation(df, x, y): plots the joint distribution of column x and column y and computes a regression line

The following shows an example of plot_correlation(). It generates correlation matrices using Pearson, Spearman, and KendallTau correlation coefficients

[2]:
from dataprep.eda import plot_correlation
from dataprep.datasets import load_dataset
df = load_dataset("wine-quality-red")
plot_correlation(df)
[2]:
DataPrep.EDA Report
Pearson Spearman KendallTau
Highest Positive Correlation 0.672 0.79 0.607
Highest Negative Correlation -0.683 -0.707 -0.528
Lowest Correlation 0.002 0.001 0.0
Mean Correlation 0.019 0.028 0.021
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (fixed_acidit..., citric_acid)
  • Most negative correlated: (fixed_acidit..., pH)
  • Least correlated: (volatile_aci..., residual_sug...)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (free_sulfur_..., total_sulfur...)
  • Most negative correlated: (fixed_acidit..., pH)
  • Least correlated: None
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (free_sulfur_..., total_sulfur...)
  • Most negative correlated: (fixed_acidit..., pH)
  • Least correlated: None

For more information about the function plot_correlation() see here.

Analyze missing values with plot_missing()

The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. The following describes the functionality of plot_missing() for a given dataframe df.

  1. plot_missing(df): plots the amount and position of missing values, and their relationship between columns

  2. plot_missing(df, x): plots the impact of the missing values in column x on all other columns

  3. plot_missing(df, x, y): plots the impact of the missing values from column x on column y in various ways.

[3]:
from dataprep.eda import plot_missing
from dataprep.datasets import load_dataset
df = load_dataset("titanic")
plot_missing(df)
[3]:
DataPrep.EDA Report

Missing Statistics

Missing Cells866
Missing Cells (%)8.1%
Missing Columns3
Missing Rows708
Avg Missing Cells per Column72.17
Avg Missing Cells per Row0.97
'height': 400
Height of the plot
'width': 400
Width of the plot
'spectrum.bins': 20
Number of bins
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot

For more information about the function plot_missing() see here.

Create a profile report with create_report()

The function create_report() generates a comprehensive profile report of the dataset. create_report() combines the individual components of the dataprep.eda package and outputs them into a nicely formatted HTML document. The document contains the following information:

  1. Overview: detect the types of columns in a dataframe

  2. Variables: variable type, unique values, distint count, missing values

  3. Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range

  4. Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

  5. Text analysis for length, sample and letter

  6. Correlations: highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices

  7. Missing Values: bar chart, heatmap and spectrum of missing values

An example report can be downloaded here.

Specifying colors

The supported colors of DataPrep.EDA match those of the Bokeh library. Color values can be provided in any of the following ways:

  • any of the 147 named CSS colors, e.g ‘green’, ‘indigo’

  • an RGB(A) hex value, e.g., ‘#FF0000’, ‘#44444444’

  • a 3-tuple of integers (r,g,b) between 0 and 255

  • a 4-tuple of (r,g,b,a) where r, g, b are integers between 0 and 255 and a is a floating point value between 0 and 1